Analysis: The second graph is easier to analysis than compared to first one, the perception problem highlighted here is the difference in the channel capacity of human preception. Its difficult to distinguish between different shades of blue while its easier to distinguish between differnet hues (channel capacity is here). Eg: The difference in values say at 400 to 900 is very similar in first graph.
ggplot(data = olive_data, aes(x = oleic, y = palmitic)) + geom_point(aes(colour = cut_interval(linolenic, n = 4))) + ggtitle("Scatter plot of palmitic vs. oleic coloured by 4 intervals of linolenic")
ggplot(data = olive_data, aes(x = oleic, y = palmitic)) + geom_point(aes(size = cut_interval(linolenic, n = 4))) + ggtitle("Scatter plot of palmitic vs. oleic sized by 4 intervals of linolenic")
## Warning: Using size for a discrete variable is not advised.
ggplot(data = olive_data, aes(x = oleic, y = palmitic)) + geom_point() + geom_spoke(aes(angle = as.numeric(cut_interval(linolenic, n = 4))*10), radius = 50) + ggtitle("Scatter plot of palmitic vs. oleic spoke angles by 4 intervals of linolenic")
Analysis: The discretized Linolenic with color was the easiest to detect boundary. This is becuase the channel capacity of detection is in the order of color>direction>size
ggplot(data = olive_data, aes(x = oleic, y = eicosenoic)) + geom_point(aes(colour = as.numeric(Region))) + ggtitle("Scatter plot of palmitic vs. oleic, coloured by region as number")
ggplot(data = olive_data, aes(x = oleic, y = eicosenoic)) + geom_point(aes(colour = Region)) + ggtitle("Scatter plot of palmitic vs. oleic, coloured by region as factors")
Analysis: Using a factor simply as number assumes that the difference between region is an increment of one, eg: Cat->Dog->Human the difference is one unit among the three, while treating them as factor does not assuming any step increment. The Preattentive pattern emerged here due to distinct colour.
ggplot(data = olive_data, aes(x = oleic, y = eicosenoic)) + geom_point(aes(colour = cut_interval(linoleic, n = 3), shape = cut_interval(palmitic, n = 3), size = cut_interval(palmitoleic, n = 3)))
## Warning: Using size for a discrete variable is not advised.
Analysis: Its very hard to distinguish between 27 types of combination due to no clear boundary between the regions, the perception problem the channel capacity does not follow mathematical summation that channel capacity of an object where there are three shades of colour + three types of shapes != channel capacity of 3 shapes + channel capacity of 3 colours in a chart.
ggplot(data = olive_data, aes(x = oleic, y = eicosenoic)) + geom_point(aes(colour = Region, shape = cut_interval(palmitic, n = 3), size = cut_interval(palmitoleic, n = 3)))
## Warning: Using size for a discrete variable is not advised.
Analysis: According to Treisman’s theory, in case of perceiving a stimulus, features are “registered early, automatically, and in parallel, while objects are identified separately” and at a later stage in processing. The explaination is as follows: Any object’s feature will be processed parallely and serially, if an object can be split with reasonable boundary (say shape) then our brain process the individual features seperately, thus identify a boundary becomes easy in this chart compared to above.
olive_data_sum <- olive_data %>% group_by(Region) %>% summarise(palmitic = sum(palmitic),
palmitoleic = sum(palmitoleic),
stearic = sum(stearic),
oleic = sum(oleic),
linoleic = sum(linoleic),
linolenic = sum(linolenic),
arachidic = sum(arachidic),
eicosenoic = sum(eicosenoic))
olive_data_sum_long <- melt(olive_data_sum, id.vars = c("Region"))
olive_data_sum_long %>% group_by(Region) %>% plot_ly(labels = ~variable, values = ~value,
type = 'pie', showlegend = FALSE) %>% layout(title = 'Total oils by region',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE), yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
Analysis: The plot does not conform to the practices under good visualization, it is generally not advised to use piechart because angles are harder to detect than size. Not having clear lables/legend simply adds to more overhead to process the information
ggplot(olive_data, aes(x = linoleic, y = eicosenoic)) + geom_point(aes(colour = Region)) + geom_density_2d()
ggplot(olive_data, aes(x = linoleic, y = eicosenoic)) + geom_point(aes(colour = Region))
Analysis: As evident from the two plots the density plot suggests a clusters being formed while the simple scatter plots coloured by region suggests a simpler cluster
baseball <- read.xlsx("baseball-2016.xlsx", sheetName = "Sheet1")
Analysis: Yes its reasonable to scale the data (perform dimensionality reduction) since more than visualizing more than 4 feaures is not advised.
baseball_scaled <- scale(baseball[,3:length(baseball)])
distance <- dist(baseball_scaled)
fit <- isoMDS(distance, k = 2, maxit = 100)
## initial value 19.856833
## iter 5 value 16.319153
## iter 10 value 16.046215
## final value 15.935476
## converged
baseball$MDS1 <- fit$points[,1]
baseball$MDS2 <- fit$points[,2]
# plotly plot
baseball %>% plot_ly(type = "scatter",
mode = "markers",
colors = "Set1",
showlegend = T) %>% add_trace(x = ~MDS1, y = ~MDS2, color = ~League, text = ~Team) %>% layout(title = "Non Metric MDS plot")
Analysis: From the scatterplot, we see that there is some difference between the 2 leagues. More teams from NL league are at the bottom of the plot as compared to teams from the AL league. Most of the teams from the AL league are concentrated in the central portion in the upper part of the plot (MDS_V1 > -3, MDS_V1 < 4, MDS_V2 > -1.5 ). Among the 2 MDS variables, MDS_V2 appears to differentiate the 2 leagues the best since it is possible to observe a boundary around MDS_V2 = -1. It is difficult to observe such a boundary using the MDS_V1 variable.
Outliers: Based on the plot, Boston Red Sox seems like an outlier. While there are other teams also which lie in the other extremes, they are not isolated and hence, do not appear like outliers.
sh <- Shepard(distance, fit$points)
Original_Dist <-as.numeric(distance)
MDS_Dist <- as.numeric(dist(fit$points))
rownames(baseball) <- baseball$Team
n=nrow(fit$points)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])
n=nrow(fit$points)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])
plot_ly()%>%
add_markers(x=~Original_Dist, y=~MDS_Dist, hoverinfo = 'text', name = "Datapoints",
text = ~paste('Obj1: ', rownames(baseball)[index1],
'<br> Obj 2: ', rownames(baseball)[index2]))%>%
#if nonmetric MDS inolved
add_lines(x=~sh$x, y=~sh$yf, name="Stress")
Analysis: The mapping line in the Shepard’s plot looks almost linear and apart from a few points, most of them are crowded close to this line. So, it seems like the stress involved in the MDS operation must have been relatively low. So, it appears like the MDS operation was successful.
While mapping observations in Non-metric MDS, if the points lie on a straight line, then it is easiest to map the observations. And those points which are farthest from the regression line of D and delta are the hardest to map. Based on this logic, we can see that observation pairs -> Minnesota twins & Arizonna Diamondbacks, Oakland Athlatics and Milwakee Brewers. NY Mets and Minnesota Twins were the most difficult to map as they lie farthest from the regression line.
MDS_plotter <- function(MDS_axis, Col){
df <- cbind(baseball[[MDS_axis]], baseball[[Col]], baseball$League, baseball$Team)
df <- as.data.frame(df)
names(df) <- c(eval(MDS_axis),eval(Col), "League", "Team")
df$League <- as.character(df$League)
# plotly plot
y_title <- list(title = Col)
x_title <- list(title = MDS_axis)
plot_return <- df %>% plot_ly(type = "scatter", mode = "markers", colors = "Set1", showlegend = T) %>%
add_trace(x = df[,1], y = df[,2], color = ~League, text = ~Team) %>%
layout(title = "Plot of MDS axis vs. variable", yaxis =y_title, xaxis=x_title)
return(plot_return)
}
# Plotting MDS on all numeric columns
# MDS axis 1
# MDS_plotter("MDS1", "Won")
# MDS_plotter("MDS1", "Lost")
# MDS_plotter("MDS1", "Runs.per.game")
# MDS_plotter("MDS1", "HR.per.game")
# MDS_plotter("MDS1", "AB")
# MDS_plotter("MDS1", "Runs")
# MDS_plotter("MDS1", "Hits")
# MDS_plotter("MDS1", "X2B")
# MDS_plotter("MDS1", "X3B")
# MDS_plotter("MDS1", "HR")
# MDS_plotter("MDS1", "RBI")
# MDS_plotter("MDS1", "StolenB")
# MDS_plotter("MDS1", "CaughtS")
# MDS_plotter("MDS1", "BB")
# MDS_plotter("MDS1", "SO")
# MDS_plotter("MDS1", "BAvg")
# MDS_plotter("MDS1", "OBP")
# MDS_plotter("MDS1", "SLG")
# MDS_plotter("MDS1", "OPS")
# MDS_plotter("MDS1", "TB")
# MDS_plotter("MDS1", "GDP")
# MDS_plotter("MDS1", "HBP")
# MDS_plotter("MDS1", "SH")
# MDS_plotter("MDS1", "SF")
# MDS_plotter("MDS1", "IBB")
# MDS_plotter("MDS1", "LOB")
#
# # MDS axis 2
# MDS_plotter("MDS2", "Won")
# MDS_plotter("MDS2", "Lost")
# MDS_plotter("MDS2", "Runs.per.game")
# MDS_plotter("MDS2", "HR.per.game")
# MDS_plotter("MDS2", "AB")
# MDS_plotter("MDS2", "Runs")
# MDS_plotter("MDS2", "Hits")
# MDS_plotter("MDS2", "X2B")
# MDS_plotter("MDS2", "X3B")
# MDS_plotter("MDS2", "HR")
# MDS_plotter("MDS2", "RBI")
# MDS_plotter("MDS2", "StolenB")
# MDS_plotter("MDS2", "CaughtS")
# MDS_plotter("MDS2", "BB")
# MDS_plotter("MDS2", "SO")
# MDS_plotter("MDS2", "BAvg")
# MDS_plotter("MDS2", "OBP")
# MDS_plotter("MDS2", "SLG")
# MDS_plotter("MDS2", "OPS")
# MDS_plotter("MDS2", "TB")
# MDS_plotter("MDS2", "GDP")
# MDS_plotter("MDS2", "HBP")
# MDS_plotter("MDS2", "SH")
# MDS_plotter("MDS2", "SF")
# MDS_plotter("MDS2", "IBB")
# MDS_plotter("MDS2", "LOB")
# Best plots are with MDS-2 with variables HR.per.game and X3B
MDS_plotter("MDS2", "HR.per.game")
MDS_plotter("MDS2", "X3B")
Analysis: According to Wikipedia, a home run (HR) is scored when the ball is hit in such a way that the batter is able to circle the bases and reach home safely in one play without any errors being committed by the defensive team in the process. According to Wikipedia, a triple (3B) is the act of a batter safely reaching third base after hitting the ball, with neither the benefit of a fielder’s misplay nor another runner being put out on a fielder’s choice. Both the variables are very important in scoring baseball teams as they are the most important way for a batting team to score runs. The MDS_V2 variable is positively correlated with Home runs and negatively correlated with Triples. Both the variables depict the batting characteristics of a team. So, MDS_V2 variable represents the batting characteristics of a team and probably denotes the ability of teams to score more home runs compared to triples.
knitr::opts_chunk$set(echo = FALSE)
library(ggplot2)
library(dplyr)
library(reshape2)
library(MASS)
library(plotly)
library(xlsx)
olive_data <- read.csv("olive.csv")
olive_data <- olive_data[,-1]
olive_data$Region <- factor(olive_data$Region, levels = c(1,2,3))
knitr::opts_chunk$set(echo = TRUE)
ggplot(data = olive_data, aes(x = oleic, y = palmitic), colour = linolenic) + geom_point(aes(colour = linoleic)) + ggtitle("Scatter plot of palmitic vs. oleic coloured by linolenic")
ggplot(data = olive_data, aes(x = oleic, y = palmitic)) + geom_point(aes(colour = cut_interval(linoleic, n = 4))) + ggtitle("Scatter plot of palmitic vs. oleic coloured by 4 intervals of linolenic")
ggplot(data = olive_data, aes(x = oleic, y = palmitic)) + geom_point(aes(colour = cut_interval(linolenic, n = 4))) + ggtitle("Scatter plot of palmitic vs. oleic coloured by 4 intervals of linolenic")
ggplot(data = olive_data, aes(x = oleic, y = palmitic)) + geom_point(aes(size = cut_interval(linolenic, n = 4))) + ggtitle("Scatter plot of palmitic vs. oleic sized by 4 intervals of linolenic")
ggplot(data = olive_data, aes(x = oleic, y = palmitic)) + geom_point() + geom_spoke(aes(angle = as.numeric(cut_interval(linolenic, n = 4))*10), radius = 50) + ggtitle("Scatter plot of palmitic vs. oleic spoke angles by 4 intervals of linolenic")
ggplot(data = olive_data, aes(x = oleic, y = eicosenoic)) + geom_point(aes(colour = as.numeric(Region))) + ggtitle("Scatter plot of palmitic vs. oleic, coloured by region as number")
ggplot(data = olive_data, aes(x = oleic, y = eicosenoic)) + geom_point(aes(colour = Region)) + ggtitle("Scatter plot of palmitic vs. oleic, coloured by region as factors")
ggplot(data = olive_data, aes(x = oleic, y = eicosenoic)) + geom_point(aes(colour = cut_interval(linoleic, n = 3), shape = cut_interval(palmitic, n = 3), size = cut_interval(palmitoleic, n = 3)))
ggplot(data = olive_data, aes(x = oleic, y = eicosenoic)) + geom_point(aes(colour = Region, shape = cut_interval(palmitic, n = 3), size = cut_interval(palmitoleic, n = 3)))
olive_data_sum <- olive_data %>% group_by(Region) %>% summarise(palmitic = sum(palmitic),
palmitoleic = sum(palmitoleic),
stearic = sum(stearic),
oleic = sum(oleic),
linoleic = sum(linoleic),
linolenic = sum(linolenic),
arachidic = sum(arachidic),
eicosenoic = sum(eicosenoic))
olive_data_sum_long <- melt(olive_data_sum, id.vars = c("Region"))
olive_data_sum_long %>% group_by(Region) %>% plot_ly(labels = ~variable, values = ~value,
type = 'pie', showlegend = FALSE) %>% layout(title = 'Total oils by region',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE), yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
ggplot(olive_data, aes(x = linoleic, y = eicosenoic)) + geom_point(aes(colour = Region)) + geom_density_2d()
ggplot(olive_data, aes(x = linoleic, y = eicosenoic)) + geom_point(aes(colour = Region))
baseball <- read.xlsx("baseball-2016.xlsx", sheetName = "Sheet1")
baseball_scaled <- scale(baseball[,3:length(baseball)])
distance <- dist(baseball_scaled)
fit <- isoMDS(distance, k = 2, maxit = 100)
baseball$MDS1 <- fit$points[,1]
baseball$MDS2 <- fit$points[,2]
# plotly plot
baseball %>% plot_ly(type = "scatter",
mode = "markers",
colors = "Set1",
showlegend = T) %>% add_trace(x = ~MDS1, y = ~MDS2, color = ~League, text = ~Team) %>% layout(title = "Non Metric MDS plot")
sh <- Shepard(distance, fit$points)
Original_Dist <-as.numeric(distance)
MDS_Dist <- as.numeric(dist(fit$points))
rownames(baseball) <- baseball$Team
n=nrow(fit$points)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])
n=nrow(fit$points)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])
plot_ly()%>%
add_markers(x=~Original_Dist, y=~MDS_Dist, hoverinfo = 'text', name = "Datapoints",
text = ~paste('Obj1: ', rownames(baseball)[index1],
'<br> Obj 2: ', rownames(baseball)[index2]))%>%
#if nonmetric MDS inolved
add_lines(x=~sh$x, y=~sh$yf, name="Stress")
MDS_plotter <- function(MDS_axis, Col){
df <- cbind(baseball[[MDS_axis]], baseball[[Col]], baseball$League, baseball$Team)
df <- as.data.frame(df)
names(df) <- c(eval(MDS_axis),eval(Col), "League", "Team")
df$League <- as.character(df$League)
# plotly plot
y_title <- list(title = Col)
x_title <- list(title = MDS_axis)
plot_return <- df %>% plot_ly(type = "scatter", mode = "markers", colors = "Set1", showlegend = T) %>%
add_trace(x = df[,1], y = df[,2], color = ~League, text = ~Team) %>%
layout(title = "Plot of MDS axis vs. variable", yaxis =y_title, xaxis=x_title)
return(plot_return)
}
# Plotting MDS on all numeric columns
# MDS axis 1
# MDS_plotter("MDS1", "Won")
# MDS_plotter("MDS1", "Lost")
# MDS_plotter("MDS1", "Runs.per.game")
# MDS_plotter("MDS1", "HR.per.game")
# MDS_plotter("MDS1", "AB")
# MDS_plotter("MDS1", "Runs")
# MDS_plotter("MDS1", "Hits")
# MDS_plotter("MDS1", "X2B")
# MDS_plotter("MDS1", "X3B")
# MDS_plotter("MDS1", "HR")
# MDS_plotter("MDS1", "RBI")
# MDS_plotter("MDS1", "StolenB")
# MDS_plotter("MDS1", "CaughtS")
# MDS_plotter("MDS1", "BB")
# MDS_plotter("MDS1", "SO")
# MDS_plotter("MDS1", "BAvg")
# MDS_plotter("MDS1", "OBP")
# MDS_plotter("MDS1", "SLG")
# MDS_plotter("MDS1", "OPS")
# MDS_plotter("MDS1", "TB")
# MDS_plotter("MDS1", "GDP")
# MDS_plotter("MDS1", "HBP")
# MDS_plotter("MDS1", "SH")
# MDS_plotter("MDS1", "SF")
# MDS_plotter("MDS1", "IBB")
# MDS_plotter("MDS1", "LOB")
#
# # MDS axis 2
# MDS_plotter("MDS2", "Won")
# MDS_plotter("MDS2", "Lost")
# MDS_plotter("MDS2", "Runs.per.game")
# MDS_plotter("MDS2", "HR.per.game")
# MDS_plotter("MDS2", "AB")
# MDS_plotter("MDS2", "Runs")
# MDS_plotter("MDS2", "Hits")
# MDS_plotter("MDS2", "X2B")
# MDS_plotter("MDS2", "X3B")
# MDS_plotter("MDS2", "HR")
# MDS_plotter("MDS2", "RBI")
# MDS_plotter("MDS2", "StolenB")
# MDS_plotter("MDS2", "CaughtS")
# MDS_plotter("MDS2", "BB")
# MDS_plotter("MDS2", "SO")
# MDS_plotter("MDS2", "BAvg")
# MDS_plotter("MDS2", "OBP")
# MDS_plotter("MDS2", "SLG")
# MDS_plotter("MDS2", "OPS")
# MDS_plotter("MDS2", "TB")
# MDS_plotter("MDS2", "GDP")
# MDS_plotter("MDS2", "HBP")
# MDS_plotter("MDS2", "SH")
# MDS_plotter("MDS2", "SF")
# MDS_plotter("MDS2", "IBB")
# MDS_plotter("MDS2", "LOB")
# Best plots are with MDS-2 with variables HR.per.game and X3B
MDS_plotter("MDS2", "HR.per.game")
MDS_plotter("MDS2", "X3B")